This project is the final group project of Applied Statistics with R taught by Professor Kostis Christodoulou at London Business School.
My collaborators are Alessandro Angeletti, Nitya Chopra, Johanna Jeffery, and Christopher Lewis. (Hail Group 13!)
The following journey will take a long time. To have a glance at our final findings, you can download our presentation deck here.
The purpose of this report is to produce a regression model which predicts the cost of a 4 night stay, for 2 people, in an Airbnb in Beijing. In order to do this, we progressed through four stages - firstly background research on the Beijing Airbnb market followed by Exploratory Data Analysis (EDA). During EDA, we viewed, cleaned, wrangled and visualised our data (specifically variability in the different independent variables and geographical mapping). Following this, we moved onto the third stage where we tested out different combinations of regressors and their functional forms to achieve a final model with the highest possible explanatory power. The process was iterative; we evaluated different variables on the basis of their t-stat value, marginal improvement in adjusted R-squared and residual standard error.
Having decided our final model with an explanatory power of 54.4%, we generated imaginary Airbnb listings with some common base characteristics such as property type “Apartment”, room type “Private”, etc. and predicted the price for a 4 night stay. In addition, we varied certain characteristics such as location, amenities, superhost status to demonstrate how price varies quite significantly as we change the values of these regressors.
The concept of homestays first appeared in China in 2011, when the concept of sharing economy began to spread to China. After 8-9 years of market education, today, this concept of sharing economy is deeply rooted in the hearts of Chinese people. Dwelled in this economy, people became more willing to utilize their spare homes and join the homestay host ranks. Meanwhile, various tourism policies in China have mentioned encouraging the development of characteristic homestays since 2015 and will continue to favor the homestay industry in the next 3-5 years. In the future, the government will keep encouraging the effective use of personal idle properties and support the development of homestays.
The economy reshaped not only people’s home usage preference but also their travel accommodation preference. Homestay has turned from hotel’s complement to substitution, with its great cost performance, various modern interior design, and high suitability for family trips. The current distribution of homestays is consistent with the overall development of China’s tourism industry. Homestays concentrate in areas where the tourism industry is relatively developed, like the East and South of China. Beijing has dominated the listing rank with over 3500 listings (2018, Homestay Investment and Investment, China Commercial Industry Research Institute).
Currently, Airbnb is one of the major players in the Chinese B2B homestay industry. It is advantaged by its international identity, reaching 110% and 250% respective growth rate of outbound travel through Airbnb and the number of people staying in domestic listings in China. However, it is also encountering challenges from local competitors in the battle of localizing and meeting the demand of the sinking market–the new focal point of all industries.
First we have to download the data.
## Rows: 36,283
## Columns: 106
## $ id <dbl> 44054, 100213, 114384, 1…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped <date> 2020-06-20, 2020-06-20,…
## $ name <chr> "Modern and Comfortable …
## $ summary <chr> "East Apartments offers …
## $ space <chr> "East Apartments is a we…
## $ description <chr> "East Apartments offers …
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> "The neighborhood is a p…
## $ notes <chr> "*For long term reservat…
## $ transit <chr> "The easiest method to g…
## $ access <chr> "*Guests have access to …
## $ interaction <chr> NA, NA, "Helen和Wendy会全程为…
## $ house_rules <chr> "Registration All guests…
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 192875, 527062, 533062, …
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "East Apartments", "Joe"…
## $ host_since <date> 2010-08-06, 2011-04-22,…
## $ host_location <chr> "Beijing, Beijing, China…
## $ host_about <chr> "Hi everyone! My name i…
## $ host_response_time <chr> "within an hour", "N/A",…
## $ host_response_rate <chr> "100%", "N/A", "100%", "…
## $ host_acceptance_rate <chr> "95%", "N/A", "100%", "1…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FAL…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <chr> "Shuangjing", NA, "ITC",…
## $ host_listings_count <dbl> 5, 4, 5, 5, 1, 7, 7, 6, …
## $ host_total_listings_count <dbl> 5, 4, 5, 5, 1, 7, 7, 6, …
## $ host_verifications <chr> "['email', 'phone', 'fac…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> FALSE, FALSE, FALSE, FAL…
## $ street <chr> "Beijing, Beijing, China…
## $ neighbourhood <chr> "Chaoyang", NA, "ITC", "…
## $ neighbourhood_cleansed <chr> "朝阳区 / Chaoyang", "密云县 /…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Beijing", "Beijing", "B…
## $ state <chr> "Beijing", "Beijing", "B…
## $ zipcode <dbl> 100022, 101508, NA, 1000…
## $ market <chr> "Beijing", "Other (Inter…
## $ smart_location <chr> "Beijing, China", "Beiji…
## $ country_code <chr> "CN", "CN", "CN", "CN", …
## $ country <chr> "China", "China", "China…
## $ latitude <dbl> 39.9, 40.7, 39.9, 39.9, …
## $ longitude <dbl> 116, 117, 116, 116, 116,…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, FALSE,…
## $ property_type <chr> "Serviced apartment", "G…
## $ room_type <chr> "Entire home/apt", "Priv…
## $ accommodates <dbl> 9, 2, 2, 2, 3, 2, 4, 2, …
## $ bathrooms <dbl> 2, 1, 1, 1, 1, 1, 1, 1, …
## $ bedrooms <dbl> 3, 1, 1, 1, 1, 1, 1, 1, …
## $ beds <dbl> 4, 1, 1, 1, 2, 1, 2, 1, …
## $ bed_type <chr> "Real Bed", "Real Bed", …
## $ amenities <chr> "{TV,\"Cable TV\",Intern…
## $ square_feet <dbl> 1464, NA, NA, NA, 323, N…
## $ price <chr> "$835.00", "$1,203.00", …
## $ weekly_price <chr> "$8,373.00", "$7,200.00"…
## $ monthly_price <chr> "$27,603.00", "$28,800.0…
## $ security_deposit <chr> "$708.00", "$0.00", NA, …
## $ cleaning_fee <chr> "$71.00", "$0.00", NA, "…
## $ guests_included <dbl> 6, 1, 1, 1, 2, 1, 1, 2, …
## $ extra_people <chr> "$71.00", "$0.00", "$0.0…
## $ minimum_nights <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ maximum_nights <dbl> 365, 30, 730, 1125, 365,…
## $ minimum_minimum_nights <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ maximum_minimum_nights <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ minimum_maximum_nights <dbl> 365, 30, 730, 1125, 365,…
## $ maximum_maximum_nights <dbl> 365, 30, 730, 1125, 365,…
## $ minimum_nights_avg_ntm <dbl> 2, 1, 1, 1, 3, 1, 1, 1, …
## $ maximum_nights_avg_ntm <dbl> 365, 30, 730, 1125, 365,…
## $ calendar_updated <chr> "5 months ago", "27 mont…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 19, 0, 19, 19, 19, 2, 0,…
## $ availability_60 <dbl> 49, 0, 49, 49, 49, 2, 0,…
## $ availability_90 <dbl> 79, 0, 79, 79, 79, 2, 0,…
## $ availability_365 <dbl> 354, 0, 354, 354, 169, 2…
## $ calendar_last_scraped <date> 2020-06-20, 2020-06-20,…
## $ number_of_reviews <dbl> 99, 2, 66, 10, 290, 26, …
## $ number_of_reviews_ltm <dbl> 7, 0, 1, 1, 22, 0, 2, 0,…
## $ first_review <date> 2010-08-25, 2017-08-27,…
## $ last_review <date> 2020-01-06, 2017-10-08,…
## $ review_scores_rating <dbl> 91, 100, 93, 98, 97, 77,…
## $ review_scores_accuracy <dbl> 9, 10, 10, 9, 10, 8, 8, …
## $ review_scores_cleanliness <dbl> 8, 9, 9, 9, 10, 7, 7, 8,…
## $ review_scores_checkin <dbl> 10, 10, 10, 9, 10, 9, 9,…
## $ review_scores_communication <dbl> 10, 10, 10, 10, 10, 9, 9…
## $ review_scores_location <dbl> 10, 9, 10, 10, 10, 9, 9,…
## $ review_scores_value <dbl> 9, 9, 10, 9, 10, 8, 9, 8…
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <chr> NA, NA, "Exempt", "Exemp…
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable <lgl> FALSE, TRUE, TRUE, TRUE,…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "strict_14_with_grace_pe…
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, FAL…
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count <dbl> 5, 4, 5, 5, 1, 5, 5, 6, …
## $ calculated_host_listings_count_entire_homes <dbl> 5, 0, 5, 5, 1, 5, 5, 5, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 3, 0, 0, 0, 0, 0, 1, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 1, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> 0.83, 0.06, 0.73, 0.11, …
From this output we can see that we have
Since this is a large data set with a lot going on, we will first select the variables we’re interested. Successively, we will also reformat them to ensure that there are no special characters such as ‘$’ or ‘%’.
listings <- data %>%
#Lets pick the variables we need
select(c(price,
cleaning_fee,
extra_people,
room_type,
property_type,
number_of_reviews,
review_scores_rating,
longitude,
latitude,
neighbourhood,
minimum_nights,
guests_included,
bathrooms,
bedrooms,
beds,
accommodates,
host_is_superhost,
neighbourhood_cleansed,
cancellation_policy,
listing_url,
is_location_exact,
security_deposit,
review_scores_cleanliness,
instant_bookable,
amenities,
calculated_host_listings_count,
reviews_per_month,
host_acceptance_rate
)
) %>%
# Removing dollar signs and changing into numerical variables
mutate(
# Changing Price from chr to dbl
price = parse_number(price),
# Changing Cleaning Fee from chr to dbl
cleaning_fee = parse_number(cleaning_fee),
# Changing Extra People fee from chr to dbl
extra_people = parse_number(extra_people),
# Changing Security Deposit from chr to dbl
security_deposit = parse_number(security_deposit),
# Changing host acceptance rate
host_acceptance_rate = parse_number(host_acceptance_rate)/100
)Now that we have all the variables in the format required, we wish to check the quality of our data by investigating some of the variables key characteristics.
# Check which variables have lots of missing values (NA's)
listings %>%
skim() %>%
kbl() %>%
kable_styling()| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | room_type | 0 | 1.000 | 11 | 15 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | property_type | 0 | 1.000 | 3 | 22 | 0 | 45 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood | 13370 | 0.632 | 3 | 36 | 0 | 61 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.000 | 3 | 16 | 0 | 16 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | cancellation_policy | 0 | 1.000 | 8 | 27 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | listing_url | 0 | 1.000 | 34 | 37 | 0 | 36283 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | amenities | 0 | 1.000 | 2 | 1917 | 0 | 28222 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 1 | 1.000 | NA | NA | NA | NA | NA | 0.264 | FAL: 26711, TRU: 9571 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | is_location_exact | 0 | 1.000 | NA | NA | NA | NA | NA | 0.565 | TRU: 20497, FAL: 15786 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.000 | NA | NA | NA | NA | NA | 0.643 | TRU: 23333, FAL: 12950 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | price | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 726.046 | 1861.040 | 0.00 | 255.00 | 396.00 | 651.00 | 70723.0 | ▇▁▁▁▁ |
| numeric | cleaning_fee | 23123 | 0.363 | NA | NA | NA | NA | NA | NA | NA | 60.943 | 218.669 | 0.00 | 0.00 | 40.00 | 70.00 | 10000.0 | ▇▁▁▁▁ |
| numeric | extra_people | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 20.474 | 79.101 | 0.00 | 0.00 | 0.00 | 0.00 | 2118.0 | ▇▁▁▁▁ |
| numeric | number_of_reviews | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 6.752 | 16.834 | 0.00 | 0.00 | 1.00 | 5.00 | 344.0 | ▇▁▁▁▁ |
| numeric | review_scores_rating | 16270 | 0.552 | NA | NA | NA | NA | NA | NA | NA | 94.789 | 10.836 | 20.00 | 94.00 | 100.00 | 100.00 | 100.0 | ▁▁▁▁▇ |
| numeric | longitude | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 116.442 | 0.258 | 115.47 | 116.34 | 116.43 | 116.50 | 117.5 | ▁▁▇▁▁ |
| numeric | latitude | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 40.022 | 0.235 | 39.46 | 39.90 | 39.94 | 40.05 | 41.0 | ▁▇▁▂▁ |
| numeric | minimum_nights | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 4.308 | 28.307 | 1.00 | 1.00 | 1.00 | 1.00 | 1086.0 | ▇▁▁▁▁ |
| numeric | guests_included | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 1.365 | 1.257 | 1.00 | 1.00 | 1.00 | 1.00 | 16.0 | ▇▁▁▁▁ |
| numeric | bathrooms | 21 | 0.999 | NA | NA | NA | NA | NA | NA | NA | 1.424 | 1.375 | 0.00 | 1.00 | 1.00 | 1.50 | 101.5 | ▇▁▁▁▁ |
| numeric | bedrooms | 142 | 0.996 | NA | NA | NA | NA | NA | NA | NA | 1.663 | 1.480 | 0.00 | 1.00 | 1.00 | 2.00 | 50.0 | ▇▁▁▁▁ |
| numeric | beds | 380 | 0.990 | NA | NA | NA | NA | NA | NA | NA | 2.242 | 2.754 | 0.00 | 1.00 | 1.00 | 2.00 | 115.0 | ▇▁▁▁▁ |
| numeric | accommodates | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 3.742 | 3.090 | 1.00 | 2.00 | 2.00 | 4.00 | 18.0 | ▇▁▁▁▁ |
| numeric | security_deposit | 23793 | 0.344 | NA | NA | NA | NA | NA | NA | NA | 655.045 | 2337.306 | 0.00 | 0.00 | 200.00 | 700.00 | 35362.0 | ▇▁▁▁▁ |
| numeric | review_scores_cleanliness | 16272 | 0.552 | NA | NA | NA | NA | NA | NA | NA | 9.518 | 1.065 | 2.00 | 9.00 | 10.00 | 10.00 | 10.0 | ▁▁▁▁▇ |
| numeric | calculated_host_listings_count | 0 | 1.000 | NA | NA | NA | NA | NA | NA | NA | 9.543 | 13.636 | 1.00 | 2.00 | 5.00 | 11.00 | 89.0 | ▇▁▁▁▁ |
| numeric | reviews_per_month | 15644 | 0.569 | NA | NA | NA | NA | NA | NA | NA | 0.649 | 0.850 | 0.01 | 0.14 | 0.31 | 0.81 | 22.9 | ▇▁▁▁▁ |
| numeric | host_acceptance_rate | 6280 | 0.827 | NA | NA | NA | NA | NA | NA | NA | 0.922 | 0.189 | 0.00 | 0.95 | 1.00 | 1.00 | 1.0 | ▁▁▁▁▇ |
Surprises in the data
cleaning_fee has an extremely high number of missing values or NA values. This is most likely due to some properties including a cleaning fee within the price, and thus look cheaper when you’re looking to book as there aren’t any “add-on” costs. Interestingly, however, we note how some properties do include this and the cleaning costs can vary widely as they range between $0 to over $10,000! Therefore, we will have to look at how the cleaning fee variable correlated with other characteristics of the listings (such as the flat size / number of bed rooms / number of guests / etc.)In this next section of code, we tackle the ‘NA’ values in cleaning_fee, security_deposit and reviews_per_month and also transform the amenities variable into a format we can use for our model.
data_cleaned <- listings %>%
# In order to handle the high volume of NA's in cleaning_fee, we will change these values to a 0
mutate(
cleaning_fee = case_when(
is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee
),
# We apply the same logic to the security_deposit variable
security_deposit = case_when(
is.na(security_deposit) ~ 0,
TRUE ~ security_deposit
),
# and again to the reviews_per_month variable
reviews_per_month = case_when(
is.na(reviews_per_month) ~ 0,
TRUE ~ reviews_per_month
),
# Creating a new variable 'wifi' which returns as TRUE when wifi is detected in the variable 'amenities'
wifi = case_when(
str_detect(amenities, "Wifi") ~ TRUE,
# Allowing for differences in spelling and upper/lowercases
str_detect(amenities, "wifi") ~ TRUE,
TRUE ~ FALSE
),
# Process repeated again to create a 'breakfast' variable
breakfast = case_when(
str_detect(amenities, "Breakfast") ~ TRUE,
str_detect(amenities, "breakfast") ~ TRUE,
TRUE ~ FALSE
),
# We are counting the number of amenities available at each property by counting the number of "," (commas) in the string.
services = sapply(strsplit(listings$amenities, ","), length),
host_acceptance_rate = case_when(
is.na(host_acceptance_rate) ~ 0,
TRUE ~ host_acceptance_rate
)
)
# lets examine wifi and breakfast columns
data_cleaned %>%
select(c(price, wifi, breakfast))## # A tibble: 36,283 x 3
## price wifi breakfast
## <dbl> <lgl> <lgl>
## 1 835 TRUE FALSE
## 2 1203 TRUE TRUE
## 3 602 TRUE FALSE
## 4 602 TRUE FALSE
## 5 411 TRUE TRUE
## 6 552 TRUE FALSE
## 7 601 TRUE FALSE
## 8 403 TRUE FALSE
## 9 743 TRUE FALSE
## 10 418 TRUE FALSE
## # … with 36,273 more rows
# Let's skim the cleaning_fee variable to see if we have succeeded
data_cleaned %>%
skim(cleaning_fee) %>%
# the kable package is used to format the resulting tables in a more visually appealing way
kbl() %>%
kable_styling()| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | cleaning_fee | 0 | 1 | 22.1 | 135 | 0 | 0 | 0 | 0 | 10000 | ▇▁▁▁▁ |
Fun facts!
Having an additional look at the data, we can note the additional stats:
56m^2;$70,723 per night;101.5 bathrooms;40% of flats are apartments; andSummary statistics can only take us so far to understanding the data, so it is important to also visualise our variables.
# Using patchwork to create a visualization of density for all numerical variables
p1 <- ggplot(data = data_cleaned, aes(x = price)) +
geom_density() +
theme_bw() +
labs(title = "Variability of Price",
subtitle="Difficulty interpreting density due to outliers in price") +
theme(
plot.title = element_text(face="bold")
)
# Before creating plots for all other numerical variables, let's check the readability
p1#Some of the x-axis for the data mean that it is difficult to get a full picture
#of the variability in the variables
p1a <- ggplot(data = data_cleaned, aes(x = price)) +
geom_density() +
#Here we add a limit to the x-axis, where the maximum value is 10000.
#We add this to most of the plots, where necessary
xlim(0, 10000) +
theme_bw() +
labs(title = "Price $", x = "", y = "") +
theme(plot.title = element_text(size = 8))
p2a <- ggplot(data = data_cleaned, aes(x = cleaning_fee)) +
geom_histogram() +
xlim(0, 300) +
theme_bw() +
labs(title = "Cleaning Fee $", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p5a <- ggplot(data = data_cleaned, aes(x = guests_included)) +
geom_histogram() +
xlim(0, 8) +
theme_bw()+
labs(title = "Guests Included", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p3a <- ggplot(data = data_cleaned, aes(x = extra_people)) +
geom_density() +
xlim(0, 400) +
theme_bw()+
labs(title = "Extra People Fee $", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p10a <- ggplot(data = data_cleaned, aes(x = number_of_reviews)) +
geom_histogram() +
xlim(0, 100) +
theme_bw()+
labs(title = "No. of Reviews", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p11a <- ggplot(data = data_cleaned, aes(x = review_scores_rating)) +
geom_histogram() +
xlim(0, 100) +
theme_bw() +
labs(title = "Review Ratings", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p9a <- ggplot(data = data_cleaned, aes(x = minimum_nights)) +
geom_histogram() +
xlim(0, 150) +
theme_bw() +
labs(title = "Minimum Night Stay", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p4a <- ggplot(data = data_cleaned, aes(x = accommodates)) +
geom_histogram() +
theme_bw()+
labs(title = "No. Accomodated", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p7a <- ggplot(data = data_cleaned, aes(x = beds)) +
geom_histogram() +
xlim(0, 20) +
theme_bw()+
labs(title = "No. of Beds", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p8a <- ggplot(data = data_cleaned, aes(x = bathrooms)) +
geom_histogram() +
xlim(0, 20) +
theme_bw()+
labs(title = "No. of Bathrooms", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p6a <- ggplot(data = data_cleaned, aes(x = bedrooms)) +
geom_histogram() +
xlim(0, 15) +
theme_bw()+
labs(title = "No. of Bedrooms", x = "", y = "")+
theme(plot.title = element_text(size = 8))
p1a + p2a + p3a + p4a + p5a + p6a + p7a + p8a + p9a + p10a + p11a +
plot_annotation(title = "Variability in Numerical Variables",
subtitle = "Majority of numerical variables are highly right-skewed")# using ggpairs to plot a correlation matrix
data_cleaned %>%
select(c(price, cleaning_fee, guests_included,
extra_people, number_of_reviews, review_scores_rating,
minimum_nights, accommodates, beds, bathrooms, bedrooms, security_deposit)
) %>%
ggpairs()Lots of data, lots of noise…
Having had some time to look through this information, we found how there were some interesting correlations between variables.
Some of the character variables have lots of different values, e.g. property_type. Here we look at cleaning this to make it more manageable.
data_cleaned %>%
# Counting the frequency of property types
count(property_type) %>%
# Arranging them into descending order by frequency
arrange(desc(n))## # A tibble: 45 x 2
## property_type n
## <chr> <int>
## 1 Apartment 14428
## 2 Condominium 4761
## 3 House 4129
## 4 Loft 2960
## 5 Serviced apartment 2189
## 6 Farm stay 1330
## 7 Villa 1222
## 8 Bungalow 985
## 9 Cottage 596
## 10 Townhouse 513
## # … with 35 more rows
Wait a second…
It is interesting to note how some of the listings don’t make a whole lot of sense. How is is that the worlds largest metropolis has a Farm Stay or a Bungalows available? When checking the listings, we indeed find how more often than not, the owners are always being 100% transparent. The most obvious lie was the listings claiming to be an igloo. This listing calls itself an igloo as the cooling power of the AC is supposedly incredible.
Anyhow,
We now classify different types of properties into 5 groups - the 4 most prominent ones and remaining smaller categories labeled as ‘Other’.
cleaning <- data_cleaned %>%
# creating a new variable 'prop_type_simplified' that groups property types
#into one of 5 categories. For example, "Boutique hotel" will now come under "Other"
mutate(prop_type_simplified = case_when(
# Here we specify that if property_type is equal to the top 4 types,
#then we pass through the property_type value
property_type %in% c("Apartment","Condominium", "House","Loft") ~ property_type,
# This specifies that if the property_type value doesn't meet this criteria,
#the new variable will equal 'Other
TRUE ~ "Other"
))Now that our categorical variables are cleaned, we can inspect the variability as we did with the numerical variables, this time using bar plots. Plotting property_types, room_types, super_host_status and cancellation_policy, to analyze their distributions.
# Simple ggplot code specifying x variable, visualisation type and theme
# using patchwork to plot distribution of different variables
p12 <- ggplot(data = cleaning, aes(x = prop_type_simplified)) +
geom_bar() +
theme_bw() +
labs(title = "Property Type (Simplified)", x = "", y = "")
p13 <- ggplot(data = cleaning, aes(x = room_type)) +
geom_bar() +
theme_bw() +
labs(title = "Room Type", x = "", y = "")
p14 <- ggplot(data = cleaning, aes(x = host_is_superhost)) +
geom_bar() +
theme_bw() +
labs(title = "Superhost", x = "", y = "")
p15 <- ggplot(data = cleaning, aes(x = cancellation_policy)) +
geom_bar() +
theme_bw() +
labs(title = "Cancellation Policy", x = "", y = "")
# Using patchwork to create a clean grid of the bar plots
p12 + p13 + p14 + p15 +
plot_annotation(title = "Apartments are the most common listing in Beijing",
subtitle = "Over half of listings have a flexible cancellation policy,
and 2/3rds list the entire property")What does this show us?
other there is an elevated level of variability, allowing customers to filter through many unique types of properties.#Here we can explore the correlation between our numerical variables
data_numerical <- data_cleaned %>%
#First we select the variables we want to plot against each other
select(c(price,
cleaning_fee,
guests_included,
extra_people,
number_of_reviews,
review_scores_rating,
minimum_nights,
accommodates,
beds,
bathrooms,
bedrooms))
# data_numerical
#Next we use a corrplot to visualise the correlations between variables
M <- cor(data_numerical, use = "pairwise.complete.obs")
col<- colorRampPalette(c("blue", "white", "purple"))(7)
corrplot(M, method = "color", col = col,
type = "upper", order = "hclust",
addCoef.col = "black",
tl.col="black", tl.srt=45,
number.cex = 0.7,
tl.cex = 0.7,
diag=FALSE
)Notable correlations with price are:
extra_people fee)guests_included)As we are looking at data over a geographical region, it can be helpful to see the geospatial spread of the Airbnb listings. Here we use the leaflet package to map our longitude and latitude data onto a map. Note that the coloring of the bubbles is done according to listing density
# Using the leaflet package
leaflet(data = filter(cleaning, minimum_nights <= 4)) %>%
# Adding the map to lie beneath the data points
addProviderTiles("OpenStreetMap.Mapnik") %>%
# Adding our listing data as points on the map
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
# Adding a function, so that when you click on a data point,
#the Airbnb URL for the listing appears
popup = ~listing_url,
# Adding a label function, so when you hover over a data point,
#the property type shows
label = ~property_type,
# Due to the high number of markers on the map, we add a cluster
#option so that it is easier to interpret
clusterOptions = markerClusterOptions())In order to run a regression model, we will transform our price data into a approximately ‘normal’ distribution.
# We want to use log to transform our data into a more normal looking distribution of data,
#let's first see how the distribution would look
cleaning %>%
filter(minimum_nights <=50) %>%
ggplot() +
geom_histogram(aes(x = minimum_nights))As we are looking to model the price of an Airbnb in Beijing for travel/tourism, we should look into the minimum_nights variable. This variable states the minimum number of nights you are able to to book the listing for.
# Visualise the frequency of minimum nights
# arranging listings by minimum_nights
cleaning %>%
count(minimum_nights) %>%
# Arrange in descending order of frequency
arrange(desc(n))## # A tibble: 66 x 2
## minimum_nights n
## <dbl> <int>
## 1 1 30216
## 2 2 2178
## 3 3 1024
## 4 30 819
## 5 7 369
## 6 5 368
## 7 15 316
## 8 90 175
## 9 10 161
## 10 60 89
## # … with 56 more rows
# calculating summary statistics for the distribution of minimum_nights
favstats(data = cleaning , ~ minimum_nights) %>%
kbl() %>%
kable_styling()| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1086 | 4.31 | 28.3 | 36283 | 0 |
From the above, we can infer the following
neighbourhoodring <- vroom::vroom("neighbourhoodring.csv")
regression_data <- cleaning %>%
# filter for minimum nights at most 4
filter(minimum_nights<=4) %>%
left_join(., neighbourhoodring, by = "neighbourhood", copy = TRUE) %>%
# New variable that computes the price of 2 people
#booking an Airbnb for 4 nights
# Note: extra_people charge per 1 extra person applied
#per night when no. of guests > guests_included
mutate(price_for_4_notlog = case_when(
guests_included < 2 ~ cleaning_fee + (4 * (price + extra_people)),
TRUE ~ cleaning_fee + (4 * price)
),
price_4_nights = log(price_for_4_notlog + 0.9),
#New variable that classifies neighborhood into 5 areas according
#to Beijing's geographical characteristic
#The 5 areas are Ring 2-6
neighbourhood_simplified = case_when(
Ring == "2" ~ "Ring 2",
Ring == "3" ~ "Ring 3",
Ring == "4" ~ "Ring 4",
Ring == "5" ~ "Ring 5",
TRUE ~ "Ring 6"
)
) %>%
subset(., select = -Ring)
regression_data## # A tibble: 33,497 x 35
## price cleaning_fee extra_people room_type property_type number_of_revie…
## <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 835 71 71 Entire h… Serviced apa… 99
## 2 1203 0 0 Private … Guest suite 2
## 3 602 0 0 Entire h… Apartment 66
## 4 602 30 0 Entire h… Apartment 10
## 5 411 71 106 Entire h… House 290
## 6 552 0 0 Entire h… Apartment 26
## 7 601 0 0 Entire h… Apartment 39
## 8 403 0 64 Entire h… Apartment 30
## 9 743 283 0 Entire h… Apartment 117
## 10 418 35 80 Entire h… Apartment 3
## # … with 33,487 more rows, and 29 more variables: review_scores_rating <dbl>,
## # longitude <dbl>, latitude <dbl>, neighbourhood <chr>, minimum_nights <dbl>,
## # guests_included <dbl>, bathrooms <dbl>, bedrooms <dbl>, beds <dbl>,
## # accommodates <dbl>, host_is_superhost <lgl>, neighbourhood_cleansed <chr>,
## # cancellation_policy <chr>, listing_url <chr>, is_location_exact <lgl>,
## # security_deposit <dbl>, review_scores_cleanliness <dbl>,
## # instant_bookable <lgl>, amenities <chr>,
## # calculated_host_listings_count <dbl>, reviews_per_month <dbl>,
## # host_acceptance_rate <dbl>, wifi <lgl>, breakfast <lgl>, services <int>,
## # prop_type_simplified <chr>, price_for_4_notlog <dbl>, price_4_nights <dbl>,
## # neighbourhood_simplified <chr>
# ggplot for price of four nights
ggplot(data = regression_data, aes(x = price_for_4_notlog)) +
geom_histogram() +
xlim(0, 40000) +
labs(
title = "Distribution of Price for 4 Nights",
x = "Price for 4 Nights",
y = "Count"
) +
# ggplot for log of price of four nights
ggplot(data = regression_data, aes(x = price_4_nights)) +
geom_density() +
labs(
title = "Density of the Logged Price for 4 Nights",
x = "Log(Price for 4 Nights)",
y = "Count"
) We complete a log transformation to change the case from a unit change to a percentage change
Why Does One Log Price?
As you can see from the Distribution of Price for 4 Nights, the variable price_4_nights s heavily right skewed. In order to complete a regression analysis on this variable, we need a variable that has more of a normal distribution. To achieve this, we log the distribution, as visible from the Density of the Logged Price for 4 Nights.
# model 1 with a few variables - reviews and property types
model1 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
regression_data)
model1 %>%
tidy(conf.int=TRUE) ## # A tibble: 7 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.91 0.0472 146. 0. 6.81 7.00
## 2 prop_type_simplifie… -0.0609 0.0160 -3.80 1.45e- 4 -0.0923 -0.0295
## 3 prop_type_simplifie… 0.202 0.0179 11.3 1.78e- 29 0.167 0.237
## 4 prop_type_simplifie… 0.106 0.0194 5.46 4.95e- 8 0.0679 0.144
## 5 prop_type_simplifie… 0.453 0.0144 31.5 1.09e-212 0.425 0.482
## 6 number_of_reviews -0.00207 0.000259 -8.00 1.32e- 15 -0.00258 -0.00156
## 7 review_scores_rating 0.00453 0.000493 9.18 4.94e- 20 0.00356 0.00550
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.071 | 0.07 | 0.74 | 236 | 0 | 6 | -20848 | 41713 | 41775 | 10218 | 18636 | 18643 |
Here, property type is a categorical variable - it has five categories and therefore makes up 4 dummy variables in the regression model. For example, the intercept term for ‘Apartment’ would just be ~ 6.91. For ‘House’, prop_type_simplifiedHouse = 1 (prop_type_simplifiedCondominium = 0 and prop_type_simplifiedOther = 0) and the intercept term would be 6.91 + 0.2 ~ 7.11. For ‘Other’, prop_type_simplifiedOther = 1 (prop_type_simplifiedCondominium = 0 and prop_type_simplifiedHouse = 0) and the intercept term would be 6.91 + 0.46 ~ 7.37. Therefore, relative to apartments, price_4_nights will be higher for houses and lofts but lower for condominiums.
Note: our Y variable is in log, so the coefficient of all X variables represent percentage change in price_4_nights per unit change in whichever X variable we’re looking at
Other variables such as number_of_reviews and review_scores_rating are statistically significant and explain the variation in price_4_nights, however, a point worth noting is that additional number_of_reviews do not lead to an increase in cost for 4 nights as the reviews may not necessarily be good reviews. On the other hand, review_scores_rating has a positive effect on price_4_nights which means that properties with a higher score/ rating would be more pricey.
# model 2 = model 1 + room type
model2 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
regression_data)
model2 %>%
tidy(conf.int=TRUE) ## # A tibble: 9 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.12 0.0411 173. 0. 7.04 7.20
## 2 prop_type_simplified… -0.0337 0.0139 -2.42 1.56e- 2 -0.0610 -0.00637
## 3 prop_type_simplified… 0.275 0.0156 17.7 3.32e-69 0.245 0.306
## 4 prop_type_simplified… -0.0265 0.0170 -1.56 1.18e- 1 -0.0598 0.00673
## 5 prop_type_simplified… 0.528 0.0126 42.0 0. 0.503 0.552
## 6 number_of_reviews -0.00140 0.000225 -6.20 5.71e-10 -0.00184 -0.000954
## 7 review_scores_rating 0.00485 0.000429 11.3 1.64e-29 0.00401 0.00569
## 8 room_typePrivate room -0.668 0.0105 -63.7 0. -0.689 -0.648
## 9 room_typeShared room -1.21 0.0224 -54.1 0. -1.25 -1.17
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.299 | 0.298 | 0.643 | 992 | 0 | 8 | -18222 | 36463 | 36542 | 7709 | 18634 | 18643 |
From the above table, we know that room_type has a very significant impact on price_4_nights as adjusted R-squared for model 2 is more than 4 times the adjusted R-squared for model 1. Room type is also a categorical variable with 3 categories, and hence makes up 2 dummy variables in the regression model.
We notice that the t-stat values for other variables which were already present in model 1, have further increased in model 2 indicating that there may be some multicollinearity between the variables. To check if that’s the case, we’ll calculate VIF.
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.04 4 1.01
## number_of_reviews 1.01 1 1.01
## review_scores_rating 1.01 1 1.00
## room_type 1.04 2 1.01
None of the variables display any sign of multicollinearity.
# creating a huxtable for summary of two models
huxreg(model1, model2,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of Models 1.0')| (1) | (2) | |
|---|---|---|
| (Intercept) | 6.906 | 7.116 |
| (0.047) | (0.041) | |
| prop_type_simplifiedCondominium | -0.061 | -0.034 |
| (0.016) | (0.014) | |
| prop_type_simplifiedHouse | 0.202 | 0.275 |
| (0.018) | (0.016) | |
| prop_type_simplifiedLoft | 0.106 | -0.027 |
| (0.019) | (0.017) | |
| prop_type_simplifiedOther | 0.453 | 0.528 |
| (0.014) | (0.013) | |
| number_of_reviews | -0.002 | -0.001 |
| (0.000) | (0.000) | |
| review_scores_rating | 0.005 | 0.005 |
| (0.000) | (0.000) | |
| room_typePrivate room | -0.668 | |
| (0.010) | ||
| room_typeShared room | -1.210 | |
| (0.022) | ||
| #observations | 18643 | 18643 |
| R squared | 0.071 | 0.299 |
| Adj. R Squared | 0.070 | 0.298 |
| Residual SE | 0.740 | 0.643 |
Previously, we plotted a correlation matrix to see which variables can be added to our regression model.
# model 3 = model 2 + beds, baths, bedrooms and no. of guests property can accommodate
model3 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates,
regression_data
)
model3 %>%
tidy(conf.int=TRUE)## # A tibble: 13 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.72 0.0351 191. 0. 6.65e+0 6.79e+0
## 2 prop_type_simplif… -0.0376 0.0118 -3.20 1.38e- 3 -6.06e-2 -1.46e-2
## 3 prop_type_simplif… 0.120 0.0133 9.08 1.23e- 19 9.45e-2 1.47e-1
## 4 prop_type_simplif… -0.0626 0.0143 -4.36 1.29e- 5 -9.07e-2 -3.45e-2
## 5 prop_type_simplif… 0.261 0.0111 23.6 2.97e-121 2.39e-1 2.83e-1
## 6 number_of_reviews -0.000394 0.000190 -2.07 3.84e- 2 -7.66e-4 -2.10e-5
## 7 review_scores_rat… 0.00331 0.000364 9.10 1.03e- 19 2.60e-3 4.03e-3
## 8 room_typePrivate … -0.410 0.00946 -43.3 0. -4.28e-1 -3.91e-1
## 9 room_typeShared r… -0.914 0.0197 -46.4 0. -9.52e-1 -8.75e-1
## 10 bedrooms 0.0756 0.00684 11.0 2.89e- 28 6.22e-2 8.90e-2
## 11 bathrooms 0.0294 0.00405 7.27 3.66e- 13 2.15e-2 3.74e-2
## 12 beds -0.0330 0.00319 -10.3 5.30e- 25 -3.92e-2 -2.67e-2
## 13 accommodates 0.117 0.00290 40.3 0. 1.11e-1 1.22e-1
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.503 | 0.503 | 0.542 | 1565 | 0 | 12 | -14963 | 29954 | 30063 | 5447 | 18559 | 18572 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.15 4 1.02
## number_of_reviews 1.02 1 1.01
## review_scores_rating 1.01 1 1.01
## room_type 1.26 2 1.06
## bedrooms 4.39 1 2.10
## bathrooms 1.62 1 1.27
## beds 3.12 1 1.77
## accommodates 4.42 1 2.10
In the table above, we can see that VIF for bedrooms, beds and accommodates is high. It is not a problem as such since their VIF is still less than 5 but compared to other variables, higher VIF is expected because more the number of beds and bedrooms, higher the number of guests the property can accommodate. So there is some correlation between these variables.
Does price of a property vary significantly if host is a Superhost?
Superhosts are experienced hosts who are most dedicated to providing outstanding hospitality to their guests. They need to maintain certain standards in response rate, cancellation rate and overall rating to earn this badge. From that perspective, we hypothesize that other factors remaining constant, a Superhost will charge prices higher than the average host. Let’s see if that’s true.
# model5 = model 4 + superhost status
model5 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost,
regression_data
)
model5 %>%
tidy(conf.int=TRUE)## # A tibble: 14 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.75e+0 0.0353 191. 0. 6.68 6.82
## 2 prop_type_simplifi… -3.95e-2 0.0117 -3.36 7.69e- 4 -0.0625 -0.0165
## 3 prop_type_simplifi… 1.22e-1 0.0133 9.20 3.79e- 20 0.0961 0.148
## 4 prop_type_simplifi… -6.55e-2 0.0143 -4.57 4.92e- 6 -0.0936 -0.0374
## 5 prop_type_simplifi… 2.63e-1 0.0111 23.8 6.65e-123 0.241 0.284
## 6 number_of_reviews -7.36e-4 0.000196 -3.76 1.70e- 4 -0.00112 -0.000352
## 7 review_scores_rati… 2.79e-3 0.000371 7.53 5.42e- 14 0.00206 0.00352
## 8 room_typePrivate r… -4.11e-1 0.00944 -43.5 0. -0.429 -0.392
## 9 room_typeShared ro… -9.09e-1 0.0197 -46.2 0. -0.948 -0.871
## 10 bedrooms 7.72e-2 0.00684 11.3 2.00e- 29 0.0638 0.0906
## 11 bathrooms 2.90e-2 0.00404 7.17 7.73e- 13 0.0211 0.0369
## 12 beds -3.29e-2 0.00318 -10.3 6.59e- 25 -0.0391 -0.0266
## 13 accommodates 1.16e-1 0.00289 40.2 0. 0.111 0.122
## 14 host_is_superhostT… 6.24e-2 0.00866 7.21 6.01e- 13 0.0454 0.0794
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.504 | 0.504 | 0.541 | 1453 | 0 | 13 | -14936 | 29902 | 30019 | 5432 | 18557 | 18571 |
Our hypothesis seems to be true;
host_is_superhostis significant as per its t-stat and p-value. One can expect the price for a Superhost’s property to be higher than an average host’s property by 0.062%
Is Location Exact?
Some hosts specify the exact location of their property; let’s see if that has any effect on the price for 4 nights.
model6 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact,
regression_data
)
model6 %>%
tidy(conf.int=TRUE)## # A tibble: 15 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.81e+0 0.0358 190. 0. 6.74 6.88
## 2 prop_type_simplifi… -4.01e-2 0.0117 -3.43 6.15e- 4 -0.0631 -0.0172
## 3 prop_type_simplifi… 1.09e-1 0.0133 8.23 1.92e- 16 0.0834 0.136
## 4 prop_type_simplifi… -6.36e-2 0.0143 -4.45 8.76e- 6 -0.0916 -0.0356
## 5 prop_type_simplifi… 2.49e-1 0.0111 22.4 3.28e-109 0.227 0.271
## 6 number_of_reviews -9.25e-4 0.000196 -4.71 2.49e- 6 -0.00131 -0.000540
## 7 review_scores_rati… 2.75e-3 0.000370 7.44 1.03e- 13 0.00203 0.00348
## 8 room_typePrivate r… -4.18e-1 0.00945 -44.2 0. -0.436 -0.399
## 9 room_typeShared ro… -9.13e-1 0.0196 -46.5 0. -0.951 -0.874
## 10 bedrooms 7.55e-2 0.00683 11.1 2.20e- 28 0.0622 0.0889
## 11 bathrooms 2.78e-2 0.00404 6.89 5.70e- 12 0.0199 0.0357
## 12 beds -3.22e-2 0.00318 -10.1 4.20e- 24 -0.0384 -0.0260
## 13 accommodates 1.15e-1 0.00289 39.9 0. 0.110 0.121
## 14 host_is_superhostT… 6.76e-2 0.00866 7.81 6.13e- 15 0.0506 0.0846
## 15 is_location_exactT… -7.74e-2 0.00825 -9.39 6.99e- 21 -0.0936 -0.0612
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.507 | 0.506 | 0.54 | 1362 | 0 | 14 | -14892 | 29816 | 29941 | 5406 | 18556 | 18571 |
Well, the variable is_location_exact seems to be significant as per its t-stat and p-value however the negative coefficient is surprising. Maybe that has something to do - not with whether the location specified is exact, but with what the location is!
For this purpose, let us include neighbourhood location into our regression model. To make things simple, we created a new variable called neighbourhood_simplified which groups different listings into broader categories or rings.
# Adding neighbourhood location
model7 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified,
regression_data
)
model7 %>%
tidy(conf.int=TRUE)## # A tibble: 19 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.95 0.0364 191. 0. 6.88 7.02
## 2 prop_type_simplifi… -0.0350 0.0115 -3.03 2.44e- 3 -0.0576 -0.0124
## 3 prop_type_simplifi… 0.0974 0.0132 7.38 1.60e- 13 0.0715 0.123
## 4 prop_type_simplifi… -0.0221 0.0146 -1.52 1.28e- 1 -0.0507 0.00638
## 5 prop_type_simplifi… 0.261 0.0114 23.0 5.70e-115 0.239 0.283
## 6 number_of_reviews -0.00173 0.000197 -8.80 1.53e- 18 -0.00211 -0.00134
## 7 review_scores_rati… 0.00316 0.000365 8.64 6.00e- 18 0.00244 0.00387
## 8 room_typePrivate r… -0.415 0.00935 -44.4 0. -0.434 -0.397
## 9 room_typeShared ro… -0.928 0.0195 -47.7 0. -0.966 -0.890
## 10 bedrooms 0.0922 0.00676 13.6 4.31e- 42 0.0789 0.105
## 11 bathrooms 0.0347 0.00399 8.72 3.11e- 18 0.0269 0.0426
## 12 beds -0.0329 0.00313 -10.5 8.66e- 26 -0.0390 -0.0268
## 13 accommodates 0.111 0.00286 38.8 4.53e-316 0.105 0.116
## 14 host_is_superhostT… 0.0635 0.00853 7.44 1.04e- 13 0.0468 0.0802
## 15 is_location_exactT… -0.0770 0.00816 -9.44 4.20e- 21 -0.0930 -0.0610
## 16 neighbourhood_simp… -0.198 0.0138 -14.3 2.41e- 46 -0.225 -0.171
## 17 neighbourhood_simp… -0.183 0.0124 -14.8 4.14e- 49 -0.207 -0.159
## 18 neighbourhood_simp… -0.205 0.0354 -5.81 6.52e- 9 -0.275 -0.136
## 19 neighbourhood_simp… -0.294 0.0123 -23.9 1.27e-124 -0.318 -0.270
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.522 | 0.521 | 0.532 | 1124 | 0 | 18 | -14609 | 29257 | 29414 | 5243 | 18552 | 18571 |
neighbourhood_simplified is a dummy variable as it has 5 categories which consist of 5 concentric rings - Ring 2, Ring 3, Ring 4, Ring 5 and Ring 6. Rings are similar to Zones in London, so Ring 2 is a more central location compared to Rings 3, 4, 5 or 6. We hypothesize that more central the location of the property, higher will be the price.
According to the coefficients of neighbourhood_simplifiedRing # above, our hypothesis is true. For example, in Ring 2 the intercept term is 6.95. The negative sign in coefficients of Ring 3, 4, 5 and 6 indicates that the intercept term will be lower by 0.2, 0.18, 0.2 and 0.29 respectively. So, further the property from central Beijing, lower the price_4_nights.
With inclusion of these location variables, our adjusted R-squared has increased to 0.492. Let’s continue to improve our model further. From the perspective of a host who is setting prices in accordance with the time, money and effort he spends in managing the property, and from the perspective of a traveler who is booking the Airbnb and paying that price, some other variables worth considering are -
model8 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
cancellation_policy,
regression_data
)
model8 %>%
tidy(conf.int=TRUE)## # A tibble: 21 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.92 0.0366 189. 0. 6.85 6.99
## 2 prop_type_simplifi… -0.0382 0.0115 -3.32 9.12e- 4 -0.0608 -0.0156
## 3 prop_type_simplifi… 0.0964 0.0132 7.32 2.65e- 13 0.0705 0.122
## 4 prop_type_simplifi… -0.0243 0.0145 -1.67 9.43e- 2 -0.0528 0.00417
## 5 prop_type_simplifi… 0.262 0.0114 23.1 4.17e-116 0.240 0.284
## 6 number_of_reviews -0.00188 0.000197 -9.54 1.58e- 21 -0.00227 -0.00150
## 7 review_scores_rati… 0.00312 0.000365 8.56 1.20e- 17 0.00241 0.00384
## 8 room_typePrivate r… -0.414 0.00934 -44.4 0. -0.432 -0.396
## 9 room_typeShared ro… -0.927 0.0195 -47.6 0. -0.965 -0.888
## 10 bedrooms 0.0928 0.00676 13.7 1.03e- 42 0.0795 0.106
## # … with 11 more rows
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.523 | 0.523 | 0.531 | 1017 | 0 | 20 | -14579 | 29203 | 29375 | 5227 | 18550 | 18571 |
model9 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
review_scores_cleanliness,
regression_data
)
model9 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.90 0.0388 178. 0. 6.82e+0 6.97
## 2 prop_type_simplif… -0.0351 0.0115 -3.04 2.37e- 3 -5.77e-2 -0.0125
## 3 prop_type_simplif… 0.0968 0.0132 7.34 2.17e- 13 7.10e-2 0.123
## 4 prop_type_simplif… -0.0227 0.0145 -1.56 1.19e- 1 -5.12e-2 0.00583
## 5 prop_type_simplif… 0.260 0.0114 22.8 6.82e-114 2.38e-1 0.282
## 6 number_of_reviews -0.00174 0.000197 -8.86 8.54e- 19 -2.13e-3 -0.00136
## 7 review_scores_rat… 0.00116 0.000587 1.98 4.78e- 2 1.14e-5 0.00231
## 8 room_typePrivate … -0.416 0.00934 -44.5 0. -4.34e-1 -0.397
## 9 room_typeShared r… -0.925 0.0195 -47.5 0. -9.63e-1 -0.887
## 10 bedrooms 0.0922 0.00676 13.6 4.04e- 42 7.89e-2 0.105
## 11 bathrooms 0.0348 0.00398 8.73 2.74e- 18 2.70e-2 0.0426
## 12 beds -0.0328 0.00313 -10.5 1.07e- 25 -3.90e-2 -0.0267
## 13 accommodates 0.111 0.00286 38.8 6.86e-317 1.05e-1 0.116
## 14 host_is_superhost… 0.0606 0.00855 7.09 1.38e- 12 4.39e-2 0.0774
## 15 is_location_exact… -0.0774 0.00815 -9.49 2.60e- 21 -9.33e-2 -0.0614
## 16 neighbourhood_sim… -0.199 0.0138 -14.4 9.13e- 47 -2.26e-1 -0.172
## 17 neighbourhood_sim… -0.183 0.0124 -14.8 1.98e- 49 -2.08e-1 -0.159
## 18 neighbourhood_sim… -0.206 0.0353 -5.82 5.95e- 9 -2.75e-1 -0.136
## 19 neighbourhood_sim… -0.296 0.0123 -24.1 2.15e-126 -3.21e-1 -0.272
## 20 review_scores_cle… 0.0259 0.00600 4.31 1.66e- 5 1.41e-2 0.0376
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.522 | 0.521 | 0.531 | 1066 | 0 | 19 | -14597 | 29236 | 29400 | 5237 | 18548 | 18568 |
Cleanliness score - significant, but AIC and BIC is higher compared to when we use cancellation policy
# Add instant bookable
model10 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
instant_bookable,
regression_data
)
model10 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.95 0.0368 189. 0. 6.88 7.02
## 2 prop_type_simplifi… -0.0350 0.0115 -3.03 2.43e- 3 -0.0576 -0.0124
## 3 prop_type_simplifi… 0.0974 0.0132 7.38 1.63e- 13 0.0715 0.123
## 4 prop_type_simplifi… -0.0222 0.0146 -1.52 1.28e- 1 -0.0507 0.00637
## 5 prop_type_simplifi… 0.261 0.0114 22.9 8.24e-115 0.239 0.283
## 6 number_of_reviews -0.00173 0.000197 -8.78 1.76e- 18 -0.00211 -0.00134
## 7 review_scores_rati… 0.00316 0.000365 8.64 5.97e- 18 0.00244 0.00387
## 8 room_typePrivate r… -0.415 0.00936 -44.4 0. -0.434 -0.397
## 9 room_typeShared ro… -0.928 0.0195 -47.6 0. -0.966 -0.890
## 10 bedrooms 0.0922 0.00676 13.6 4.30e- 42 0.0789 0.105
## 11 bathrooms 0.0347 0.00399 8.71 3.16e- 18 0.0269 0.0426
## 12 beds -0.0329 0.00313 -10.5 8.62e- 26 -0.0390 -0.0268
## 13 accommodates 0.111 0.00286 38.8 5.13e-316 0.105 0.116
## 14 host_is_superhostT… 0.0634 0.00854 7.43 1.16e- 13 0.0467 0.0802
## 15 is_location_exactT… -0.0771 0.00817 -9.43 4.75e- 21 -0.0931 -0.0610
## 16 neighbourhood_simp… -0.198 0.0138 -14.3 2.68e- 46 -0.225 -0.171
## 17 neighbourhood_simp… -0.183 0.0124 -14.8 4.13e- 49 -0.207 -0.159
## 18 neighbourhood_simp… -0.205 0.0354 -5.81 6.52e- 9 -0.275 -0.136
## 19 neighbourhood_simp… -0.294 0.0123 -23.9 4.30e-124 -0.318 -0.270
## 20 instant_bookableTR… 0.00109 0.00841 0.129 8.97e- 1 -0.0154 0.0176
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.522 | 0.521 | 0.532 | 1064 | 0 | 19 | -14609 | 29259 | 29423 | 5243 | 18551 | 18571 |
instant_bookable has a t stat below threshold, and is therefore not significant.
# using security deposit normally here
model11 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
security_deposit,
regression_data
)
model11 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.95e+0 0.0363 191. 0. 6.88e+0 7.02e+0
## 2 prop_type_simpli… -3.60e-2 0.0115 -3.13 1.75e- 3 -5.86e-2 -1.35e-2
## 3 prop_type_simpli… 9.74e-2 0.0132 7.40 1.43e- 13 7.16e-2 1.23e-1
## 4 prop_type_simpli… -2.14e-2 0.0145 -1.47 1.41e- 1 -4.98e-2 7.11e-3
## 5 prop_type_simpli… 2.60e-1 0.0114 22.9 9.30e-115 2.38e-1 2.83e-1
## 6 number_of_reviews -1.84e-3 0.000196 -9.34 1.04e- 20 -2.22e-3 -1.45e-3
## 7 review_scores_ra… 3.11e-3 0.000365 8.53 1.52e- 17 2.40e-3 3.83e-3
## 8 room_typePrivate… -4.12e-1 0.00933 -44.2 0. -4.31e-1 -3.94e-1
## 9 room_typeShared … -9.23e-1 0.0194 -47.5 0. -9.61e-1 -8.85e-1
## 10 bedrooms 9.19e-2 0.00675 13.6 4.86e- 42 7.87e-2 1.05e-1
## 11 bathrooms 3.47e-2 0.00398 8.72 3.03e- 18 2.69e-2 4.25e-2
## 12 beds -3.25e-2 0.00312 -10.4 2.88e- 25 -3.86e-2 -2.64e-2
## 13 accommodates 1.10e-1 0.00285 38.7 8.38e-315 1.05e-1 1.16e-1
## 14 host_is_superhos… 6.22e-2 0.00851 7.31 2.84e- 13 4.55e-2 7.89e-2
## 15 is_location_exac… -7.44e-2 0.00814 -9.14 6.72e- 20 -9.04e-2 -5.85e-2
## 16 neighbourhood_si… -1.99e-1 0.0138 -14.4 6.03e- 47 -2.26e-1 -1.72e-1
## 17 neighbourhood_si… -1.85e-1 0.0123 -15.0 2.64e- 50 -2.09e-1 -1.60e-1
## 18 neighbourhood_si… -2.07e-1 0.0353 -5.86 4.84e- 9 -2.76e-1 -1.37e-1
## 19 neighbourhood_si… -2.92e-1 0.0123 -23.8 8.47e-124 -3.17e-1 -2.68e-1
## 20 security_deposit 2.38e-5 0.00000257 9.24 2.77e- 20 1.87e-5 2.88e-5
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.524 | 0.523 | 0.53 | 1074 | 0 | 19 | -14566 | 29174 | 29338 | 5219 | 18551 | 18571 |
# using log of security deposit instead as it is a highly skewed variable
model12 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
log(security_deposit + 0.001),
regression_data
)
model12 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.99 0.0364 192. 0. 6.92 7.06
## 2 prop_type_simplifi… -0.0377 0.0115 -3.28 1.04e- 3 -0.0602 -0.0152
## 3 prop_type_simplifi… 0.101 0.0131 7.71 1.36e- 14 0.0755 0.127
## 4 prop_type_simplifi… -0.0247 0.0145 -1.71 8.79e- 2 -0.0531 0.00367
## 5 prop_type_simplifi… 0.265 0.0113 23.4 5.01e-119 0.243 0.287
## 6 number_of_reviews -0.00191 0.000196 -9.75 2.14e- 22 -0.00230 -0.00153
## 7 review_scores_rati… 0.00296 0.000364 8.13 4.59e- 16 0.00225 0.00367
## 8 room_typePrivate r… -0.406 0.00934 -43.4 0. -0.424 -0.388
## 9 room_typeShared ro… -0.908 0.0195 -46.7 0. -0.946 -0.870
## 10 bedrooms 0.0912 0.00674 13.5 1.61e- 41 0.0780 0.104
## 11 bathrooms 0.0344 0.00397 8.67 4.64e- 18 0.0266 0.0422
## 12 beds -0.0323 0.00312 -10.4 4.52e- 25 -0.0384 -0.0262
## 13 accommodates 0.110 0.00285 38.7 4.48e-315 0.105 0.116
## 14 host_is_superhostT… 0.0572 0.00851 6.72 1.88e- 11 0.0405 0.0739
## 15 is_location_exactT… -0.0708 0.00814 -8.70 3.63e- 18 -0.0868 -0.0548
## 16 neighbourhood_simp… -0.196 0.0137 -14.3 5.61e- 46 -0.223 -0.169
## 17 neighbourhood_simp… -0.185 0.0123 -15.0 7.79e- 51 -0.210 -0.161
## 18 neighbourhood_simp… -0.210 0.0352 -5.97 2.48e- 9 -0.279 -0.141
## 19 neighbourhood_simp… -0.287 0.0123 -23.5 7.17e-120 -0.311 -0.263
## 20 log(security_depos… 0.00817 0.000666 12.3 1.77e- 34 0.00686 0.00947
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.525 | 0.525 | 0.53 | 1081 | 0 | 19 | -14533 | 29109 | 29273 | 5201 | 18551 | 18571 |
# host acceptance rate
model12.5 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
host_acceptance_rate,
regression_data
)
model12.5 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.95 0.0384 181. 0. 6.87 7.02
## 2 prop_type_simplifi… -0.0350 0.0115 -3.04 2.41e- 3 -0.0576 -0.0124
## 3 prop_type_simplifi… 0.0975 0.0132 7.39 1.50e- 13 0.0717 0.123
## 4 prop_type_simplifi… -0.0224 0.0146 -1.54 1.24e- 1 -0.0509 0.00617
## 5 prop_type_simplifi… 0.261 0.0114 22.9 1.08e-114 0.239 0.283
## 6 number_of_reviews -0.00173 0.000197 -8.81 1.35e- 18 -0.00212 -0.00135
## 7 review_scores_rati… 0.00316 0.000366 8.65 5.40e- 18 0.00245 0.00388
## 8 room_typePrivate r… -0.415 0.00939 -44.2 0. -0.433 -0.397
## 9 room_typeShared ro… -0.927 0.0195 -47.5 0. -0.966 -0.889
## 10 bedrooms 0.0923 0.00677 13.6 3.83e- 42 0.0790 0.106
## 11 bathrooms 0.0347 0.00399 8.71 3.21e- 18 0.0269 0.0426
## 12 beds -0.0329 0.00313 -10.5 8.35e- 26 -0.0391 -0.0268
## 13 accommodates 0.111 0.00286 38.7 2.59e-315 0.105 0.116
## 14 host_is_superhostT… 0.0625 0.00875 7.14 9.65e- 13 0.0453 0.0796
## 15 is_location_exactT… -0.0776 0.00825 -9.41 5.75e- 21 -0.0938 -0.0615
## 16 neighbourhood_simp… -0.198 0.0138 -14.3 2.59e- 46 -0.225 -0.171
## 17 neighbourhood_simp… -0.183 0.0124 -14.8 4.91e- 49 -0.207 -0.158
## 18 neighbourhood_simp… -0.205 0.0354 -5.80 6.80e- 9 -0.274 -0.136
## 19 neighbourhood_simp… -0.294 0.0123 -23.9 1.14e-124 -0.318 -0.270
## 20 host_acceptance_ra… 0.00742 0.0145 0.513 6.08e- 1 -0.0209 0.0357
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.522 | 0.521 | 0.532 | 1064 | 0 | 19 | -14608 | 29259 | 29423 | 5243 | 18551 | 18571 |
# summary table to compare last few models
huxreg(model8, model9, model10, model11, model12, model12.5,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of Models 2.0')| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 6.923 | 6.897 | 6.951 | 6.950 | 6.993 | 6.945 |
| (0.037) | (0.039) | (0.037) | (0.036) | (0.036) | (0.038) | |
| prop_type_simplifiedCondominium | -0.038 | -0.035 | -0.035 | -0.036 | -0.038 | -0.035 |
| (0.012) | (0.012) | (0.012) | (0.012) | (0.011) | (0.012) | |
| prop_type_simplifiedHouse | 0.096 | 0.097 | 0.097 | 0.097 | 0.101 | 0.098 |
| (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | |
| prop_type_simplifiedLoft | -0.024 | -0.023 | -0.022 | -0.021 | -0.025 | -0.022 |
| (0.015) | (0.015) | (0.015) | (0.015) | (0.014) | (0.015) | |
| prop_type_simplifiedOther | 0.262 | 0.260 | 0.261 | 0.260 | 0.265 | 0.261 |
| (0.011) | (0.011) | (0.011) | (0.011) | (0.011) | (0.011) | |
| number_of_reviews | -0.002 | -0.002 | -0.002 | -0.002 | -0.002 | -0.002 |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.003 | 0.001 | 0.003 | 0.003 | 0.003 | 0.003 |
| (0.000) | (0.001) | (0.000) | (0.000) | (0.000) | (0.000) | |
| room_typePrivate room | -0.414 | -0.416 | -0.415 | -0.412 | -0.406 | -0.415 |
| (0.009) | (0.009) | (0.009) | (0.009) | (0.009) | (0.009) | |
| room_typeShared room | -0.927 | -0.925 | -0.928 | -0.923 | -0.908 | -0.927 |
| (0.019) | (0.019) | (0.020) | (0.019) | (0.019) | (0.020) | |
| bedrooms | 0.093 | 0.092 | 0.092 | 0.092 | 0.091 | 0.092 |
| (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | |
| bathrooms | 0.034 | 0.035 | 0.035 | 0.035 | 0.034 | 0.035 |
| (0.004) | (0.004) | (0.004) | (0.004) | (0.004) | (0.004) | |
| beds | -0.032 | -0.033 | -0.033 | -0.032 | -0.032 | -0.033 |
| (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | |
| accommodates | 0.109 | 0.111 | 0.111 | 0.110 | 0.110 | 0.111 |
| (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | |
| host_is_superhostTRUE | 0.055 | 0.061 | 0.063 | 0.062 | 0.057 | 0.062 |
| (0.009) | (0.009) | (0.009) | (0.009) | (0.009) | (0.009) | |
| is_location_exactTRUE | -0.075 | -0.077 | -0.077 | -0.074 | -0.071 | -0.078 |
| (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | |
| neighbourhood_simplifiedRing 3 | -0.194 | -0.199 | -0.198 | -0.199 | -0.196 | -0.198 |
| (0.014) | (0.014) | (0.014) | (0.014) | (0.014) | (0.014) | |
| neighbourhood_simplifiedRing 4 | -0.180 | -0.183 | -0.183 | -0.185 | -0.185 | -0.183 |
| (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
| neighbourhood_simplifiedRing 5 | -0.203 | -0.206 | -0.205 | -0.207 | -0.210 | -0.205 |
| (0.035) | (0.035) | (0.035) | (0.035) | (0.035) | (0.035) | |
| neighbourhood_simplifiedRing 6 | -0.285 | -0.296 | -0.294 | -0.292 | -0.287 | -0.294 |
| (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
| cancellation_policymoderate | 0.055 | |||||
| (0.009) | ||||||
| cancellation_policystrict_14_with_grace_period | 0.070 | |||||
| (0.010) | ||||||
| review_scores_cleanliness | 0.026 | |||||
| (0.006) | ||||||
| instant_bookableTRUE | 0.001 | |||||
| (0.008) | ||||||
| security_deposit | 0.000 | |||||
| (0.000) | ||||||
| log(security_deposit + 0.001) | 0.008 | |||||
| (0.001) | ||||||
| host_acceptance_rate | 0.007 | |||||
| (0.014) | ||||||
| #observations | 18571 | 18568 | 18571 | 18571 | 18571 | 18571 |
| R squared | 0.523 | 0.522 | 0.522 | 0.524 | 0.525 | 0.522 |
| Adj. R Squared | 0.523 | 0.521 | 0.521 | 0.523 | 0.525 | 0.521 |
| Residual SE | 0.531 | 0.531 | 0.532 | 0.530 | 0.530 | 0.532 |
On the basis of the models above, we will select the variables which improve the model, log(security_deposit) for example, and exclude the insignificant ones such as host_acceptance_rate.
# amenities - try three models for two amenities - Wifi and Breakfast
#just wifi
model13 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
wifi,
regression_data
)
model13 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.84 0.0436 157. 0. 6.76 6.93
## 2 prop_type_simplifi… -0.0359 0.0115 -3.11 1.89e- 3 -0.0585 -0.0132
## 3 prop_type_simplifi… 0.0973 0.0132 7.38 1.64e- 13 0.0715 0.123
## 4 prop_type_simplifi… -0.0234 0.0145 -1.61 1.07e- 1 -0.0519 0.00510
## 5 prop_type_simplifi… 0.261 0.0114 23.0 4.42e-115 0.239 0.283
## 6 number_of_reviews -0.00176 0.000197 -8.96 3.42e- 19 -0.00215 -0.00138
## 7 review_scores_rati… 0.00309 0.000365 8.47 2.65e- 17 0.00238 0.00381
## 8 room_typePrivate r… -0.416 0.00934 -44.5 0. -0.434 -0.398
## 9 room_typeShared ro… -0.927 0.0195 -47.6 0. -0.966 -0.889
## 10 bedrooms 0.0925 0.00676 13.7 2.15e- 42 0.0792 0.106
## 11 bathrooms 0.0347 0.00398 8.70 3.55e- 18 0.0269 0.0425
## 12 beds -0.0330 0.00313 -10.5 6.25e- 26 -0.0391 -0.0269
## 13 accommodates 0.110 0.00286 38.7 7.68e-315 0.105 0.116
## 14 host_is_superhostT… 0.0625 0.00853 7.32 2.49e- 13 0.0458 0.0792
## 15 is_location_exactT… -0.0766 0.00815 -9.40 6.06e- 21 -0.0926 -0.0607
## 16 neighbourhood_simp… -0.196 0.0138 -14.2 1.01e- 45 -0.223 -0.169
## 17 neighbourhood_simp… -0.182 0.0124 -14.8 5.51e- 49 -0.207 -0.158
## 18 neighbourhood_simp… -0.205 0.0353 -5.81 6.31e- 9 -0.275 -0.136
## 19 neighbourhood_simp… -0.293 0.0123 -23.8 1.58e-123 -0.317 -0.269
## 20 wifiTRUE 0.118 0.0262 4.50 6.99e- 6 0.0663 0.169
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.522 | 0.522 | 0.531 | 1067 | 0 | 19 | -14598 | 29239 | 29403 | 5238 | 18551 | 18571 |
#just breakfast
model14 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
breakfast,
regression_data
)
model14 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.97 0.0361 193. 0. 6.90 7.04
## 2 prop_type_simplifi… -0.0292 0.0114 -2.55 1.07e- 2 -0.0516 -0.00677
## 3 prop_type_simplifi… 0.0931 0.0131 7.14 9.93e- 13 0.0676 0.119
## 4 prop_type_simplifi… -0.0129 0.0144 -0.892 3.72e- 1 -0.0411 0.0154
## 5 prop_type_simplifi… 0.224 0.0114 19.6 1.41e- 84 0.201 0.246
## 6 number_of_reviews -0.00178 0.000195 -9.17 5.06e- 20 -0.00217 -0.00140
## 7 review_scores_rati… 0.00303 0.000362 8.38 5.85e- 17 0.00232 0.00374
## 8 room_typePrivate r… -0.443 0.00935 -47.3 0. -0.461 -0.424
## 9 room_typeShared ro… -0.952 0.0193 -49.3 0. -0.990 -0.914
## 10 bedrooms 0.0907 0.00669 13.5 1.37e- 41 0.0775 0.104
## 11 bathrooms 0.0312 0.00395 7.91 2.69e- 15 0.0235 0.0390
## 12 beds -0.0330 0.00310 -10.7 1.95e- 26 -0.0391 -0.0269
## 13 accommodates 0.110 0.00283 38.8 1.26e-316 0.104 0.115
## 14 host_is_superhostT… 0.0636 0.00844 7.53 5.20e- 14 0.0470 0.0801
## 15 is_location_exactT… -0.0686 0.00808 -8.49 2.13e- 17 -0.0845 -0.0528
## 16 neighbourhood_simp… -0.196 0.0137 -14.4 1.51e- 46 -0.223 -0.169
## 17 neighbourhood_simp… -0.182 0.0122 -14.9 6.20e- 50 -0.206 -0.158
## 18 neighbourhood_simp… -0.205 0.0350 -5.87 4.54e- 9 -0.274 -0.137
## 19 neighbourhood_simp… -0.316 0.0122 -25.9 3.22e-145 -0.340 -0.292
## 20 breakfastTRUE 0.266 0.0134 19.9 6.03e- 87 0.239 0.292
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.532 | 0.531 | 0.526 | 1108 | 0 | 19 | -14413 | 28868 | 29032 | 5134 | 18551 | 18571 |
# both wifi and breakfast
model15 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
wifi +
breakfast,
regression_data
)
model15 %>%
tidy(conf.int=TRUE)## # A tibble: 21 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.87 0.0432 159. 0. 6.79 6.96
## 2 prop_type_simplifie… -0.0300 0.0114 -2.62 8.70e- 3 -0.0524 -0.00758
## 3 prop_type_simplifie… 0.0931 0.0130 7.13 1.00e-12 0.0675 0.119
## 4 prop_type_simplifie… -0.0140 0.0144 -0.974 3.30e- 1 -0.0423 0.0142
## 5 prop_type_simplifie… 0.224 0.0114 19.6 8.65e-85 0.201 0.246
## 6 number_of_reviews -0.00181 0.000195 -9.32 1.29e-20 -0.00220 -0.00143
## 7 review_scores_rating 0.00297 0.000362 8.22 2.10e-16 0.00227 0.00368
## 8 room_typePrivate ro… -0.443 0.00935 -47.4 0. -0.461 -0.425
## 9 room_typeShared room -0.951 0.0193 -49.3 0. -0.989 -0.914
## 10 bedrooms 0.0909 0.00669 13.6 7.35e-42 0.0778 0.104
## # … with 11 more rows
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.532 | 0.531 | 0.526 | 1054 | 0 | 20 | -14405 | 28854 | 29026 | 5130 | 18550 | 18571 |
# count of amenities
model16 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
services,
regression_data
)
model16 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.84 0.0364 188. 0. 6.77 6.91
## 2 prop_type_simplifi… -0.0426 0.0114 -3.73 1.90e- 4 -0.0650 -0.0202
## 3 prop_type_simplifi… 0.100 0.0130 7.67 1.80e- 14 0.0744 0.126
## 4 prop_type_simplifi… -0.0249 0.0144 -1.73 8.29e- 2 -0.0531 0.00325
## 5 prop_type_simplifi… 0.247 0.0113 21.9 2.54e-105 0.225 0.269
## 6 number_of_reviews -0.00264 0.000199 -13.3 5.98e- 40 -0.00303 -0.00225
## 7 review_scores_rati… 0.00269 0.000362 7.45 9.90e- 14 0.00198 0.00340
## 8 room_typePrivate r… -0.409 0.00924 -44.2 0. -0.427 -0.391
## 9 room_typeShared ro… -0.893 0.0193 -46.2 0. -0.931 -0.855
## 10 bedrooms 0.0929 0.00668 13.9 1.03e- 43 0.0798 0.106
## 11 bathrooms 0.0303 0.00395 7.67 1.78e- 14 0.0225 0.0380
## 12 beds -0.0319 0.00309 -10.3 6.33e- 25 -0.0380 -0.0259
## 13 accommodates 0.106 0.00283 37.5 3.15e-296 0.101 0.112
## 14 host_is_superhostT… 0.0298 0.00858 3.47 5.25e- 4 0.0129 0.0466
## 15 is_location_exactT… -0.0686 0.00807 -8.50 1.98e- 17 -0.0844 -0.0528
## 16 neighbourhood_simp… -0.209 0.0137 -15.3 1.04e- 52 -0.236 -0.182
## 17 neighbourhood_simp… -0.199 0.0123 -16.3 3.69e- 59 -0.223 -0.175
## 18 neighbourhood_simp… -0.223 0.0350 -6.38 1.83e- 10 -0.291 -0.154
## 19 neighbourhood_simp… -0.305 0.0122 -25.0 3.84e-136 -0.328 -0.281
## 20 services 0.00871 0.000413 21.1 1.98e- 97 0.00790 0.00952
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.533 | 0.532 | 0.525 | 1113 | 0 | 19 | -14389 | 28820 | 28984 | 5121 | 18551 | 18571 |
# log of number of amenities
model17 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms +
beds +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
log(services + 0.000001),
regression_data
)
model17 %>%
tidy(conf.int=TRUE)## # A tibble: 20 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.43 0.0442 146. 0. 6.34 6.51
## 2 prop_type_simplifi… -0.0453 0.0114 -3.97 7.27e- 5 -0.0677 -0.0229
## 3 prop_type_simplifi… 0.100 0.0130 7.69 1.49e- 14 0.0748 0.126
## 4 prop_type_simplifi… -0.0286 0.0144 -1.99 4.68e- 2 -0.0568 -0.000400
## 5 prop_type_simplifi… 0.248 0.0113 22.0 2.31e-106 0.226 0.271
## 6 number_of_reviews -0.00260 0.000199 -13.1 7.39e- 39 -0.00299 -0.00221
## 7 review_scores_rati… 0.00265 0.000362 7.31 2.78e- 13 0.00194 0.00336
## 8 room_typePrivate r… -0.407 0.00925 -44.0 0. -0.426 -0.389
## 9 room_typeShared ro… -0.886 0.0194 -45.8 0. -0.924 -0.848
## 10 bedrooms 0.0922 0.00669 13.8 5.17e- 43 0.0791 0.105
## 11 bathrooms 0.0307 0.00395 7.79 7.02e- 15 0.0230 0.0385
## 12 beds -0.0315 0.00310 -10.2 2.90e- 24 -0.0376 -0.0254
## 13 accommodates 0.107 0.00283 37.7 1.18e-299 0.101 0.112
## 14 host_is_superhostT… 0.0330 0.00856 3.85 1.17e- 4 0.0162 0.0498
## 15 is_location_exactT… -0.0677 0.00808 -8.38 5.90e- 17 -0.0835 -0.0518
## 16 neighbourhood_simp… -0.210 0.0137 -15.4 4.00e- 53 -0.237 -0.183
## 17 neighbourhood_simp… -0.202 0.0123 -16.4 3.09e- 60 -0.226 -0.178
## 18 neighbourhood_simp… -0.224 0.0350 -6.40 1.54e- 10 -0.293 -0.155
## 19 neighbourhood_simp… -0.307 0.0122 -25.2 3.30e-138 -0.331 -0.283
## 20 log(services + 1e-… 0.202 0.00982 20.5 1.25e- 92 0.182 0.221
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.532 | 0.532 | 0.526 | 1111 | 0 | 19 | -14400 | 28842 | 29006 | 5127 | 18551 | 18571 |
# summary table to compare last few models
huxreg(model13, model14, model15, model16, model17,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of Models 3.0')| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| (Intercept) | 6.843 | 6.966 | 6.871 | 6.840 | 6.427 |
| (0.044) | (0.036) | (0.043) | (0.036) | (0.044) | |
| prop_type_simplifiedCondominium | -0.036 | -0.029 | -0.030 | -0.043 | -0.045 |
| (0.012) | (0.011) | (0.011) | (0.011) | (0.011) | |
| prop_type_simplifiedHouse | 0.097 | 0.093 | 0.093 | 0.100 | 0.100 |
| (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | |
| prop_type_simplifiedLoft | -0.023 | -0.013 | -0.014 | -0.025 | -0.029 |
| (0.015) | (0.014) | (0.014) | (0.014) | (0.014) | |
| prop_type_simplifiedOther | 0.261 | 0.224 | 0.224 | 0.247 | 0.248 |
| (0.011) | (0.011) | (0.011) | (0.011) | (0.011) | |
| number_of_reviews | -0.002 | -0.002 | -0.002 | -0.003 | -0.003 |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.003 | 0.003 | 0.003 | 0.003 | 0.003 |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| room_typePrivate room | -0.416 | -0.443 | -0.443 | -0.409 | -0.407 |
| (0.009) | (0.009) | (0.009) | (0.009) | (0.009) | |
| room_typeShared room | -0.927 | -0.952 | -0.951 | -0.893 | -0.886 |
| (0.019) | (0.019) | (0.019) | (0.019) | (0.019) | |
| bedrooms | 0.092 | 0.091 | 0.091 | 0.093 | 0.092 |
| (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | |
| bathrooms | 0.035 | 0.031 | 0.031 | 0.030 | 0.031 |
| (0.004) | (0.004) | (0.004) | (0.004) | (0.004) | |
| beds | -0.033 | -0.033 | -0.033 | -0.032 | -0.032 |
| (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | |
| accommodates | 0.110 | 0.110 | 0.109 | 0.106 | 0.107 |
| (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | |
| host_is_superhostTRUE | 0.062 | 0.064 | 0.063 | 0.030 | 0.033 |
| (0.009) | (0.008) | (0.008) | (0.009) | (0.009) | |
| is_location_exactTRUE | -0.077 | -0.069 | -0.068 | -0.069 | -0.068 |
| (0.008) | (0.008) | (0.008) | (0.008) | (0.008) | |
| neighbourhood_simplifiedRing 3 | -0.196 | -0.196 | -0.195 | -0.209 | -0.210 |
| (0.014) | (0.014) | (0.014) | (0.014) | (0.014) | |
| neighbourhood_simplifiedRing 4 | -0.182 | -0.182 | -0.182 | -0.199 | -0.202 |
| (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
| neighbourhood_simplifiedRing 5 | -0.205 | -0.205 | -0.205 | -0.223 | -0.224 |
| (0.035) | (0.035) | (0.035) | (0.035) | (0.035) | |
| neighbourhood_simplifiedRing 6 | -0.293 | -0.316 | -0.315 | -0.305 | -0.307 |
| (0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
| wifiTRUE | 0.118 | 0.104 | |||
| (0.026) | (0.026) | ||||
| breakfastTRUE | 0.266 | 0.264 | |||
| (0.013) | (0.013) | ||||
| services | 0.009 | ||||
| (0.000) | |||||
| log(services + 1e-06) | 0.202 | ||||
| (0.010) | |||||
| #observations | 18571 | 18571 | 18571 | 18571 | 18571 |
| R squared | 0.522 | 0.532 | 0.532 | 0.533 | 0.532 |
| Adj. R Squared | 0.522 | 0.531 | 0.531 | 0.532 | 0.532 |
| Residual SE | 0.531 | 0.526 | 0.526 | 0.525 | 0.526 |
The number of amenities taken collectively as a numerical value services does a better job at explaining variations in the regressand than wifi or breakfast alone. However, we will still include ‘wifi’ and ‘breakfast’ in the final model as these are two of the most important amenities people look for while booking Airbnbs.
final_model <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
beds +
bathrooms +
accommodates +
host_is_superhost +
is_location_exact +
neighbourhood_simplified +
cancellation_policy +
log(security_deposit + 0.001) +
wifi +
breakfast +
services,
regression_data
)
final_model %>%
tidy(conf.int=TRUE)## # A tibble: 25 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.83 0.0431 159. 0. 6.75 6.92
## 2 prop_type_simplifie… -0.0410 0.0113 -3.63 2.88e- 4 -0.0631 -0.0188
## 3 prop_type_simplifie… 0.0980 0.0129 7.61 2.95e-14 0.0728 0.123
## 4 prop_type_simplifie… -0.0207 0.0142 -1.46 1.45e- 1 -0.0486 0.00717
## 5 prop_type_simplifie… 0.220 0.0113 19.5 3.50e-84 0.198 0.243
## 6 number_of_reviews -0.00277 0.000198 -14.0 1.70e-44 -0.00316 -0.00239
## 7 review_scores_rating 0.00247 0.000358 6.91 5.17e-12 0.00177 0.00317
## 8 room_typePrivate ro… -0.426 0.00927 -45.9 0. -0.444 -0.408
## 9 room_typeShared room -0.905 0.0192 -47.1 0. -0.942 -0.867
## 10 bedrooms 0.0914 0.00661 13.8 2.71e-43 0.0785 0.104
## # … with 15 more rows
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.544 | 0.543 | 0.519 | 921 | 0 | 24 | -14169 | 28391 | 28594 | 5001 | 18546 | 18571 |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.42 4 1.04
## number_of_reviews 1.20 1 1.10
## review_scores_rating 1.07 1 1.03
## room_type 1.35 2 1.08
## bedrooms 4.46 1 2.11
## beds 3.12 1 1.77
## bathrooms 1.64 1 1.28
## accommodates 4.52 1 2.13
## host_is_superhost 1.19 1 1.09
## is_location_exact 1.09 1 1.04
## neighbourhood_simplified 1.41 4 1.04
## cancellation_policy 1.12 2 1.03
## log(security_deposit + 0.001) 1.09 1 1.04
## wifi 1.02 1 1.01
## breakfast 1.18 1 1.08
## services 1.25 1 1.12
## Estimate Std. Error t value
## (Intercept) 6.834929 0.043110 158.55
## prop_type_simplifiedCondominium -0.040955 0.011292 -3.63
## prop_type_simplifiedHouse 0.098044 0.012890 7.61
## prop_type_simplifiedLoft -0.020720 0.014230 -1.46
## prop_type_simplifiedOther 0.220428 0.011281 19.54
## number_of_reviews -0.002774 0.000198 -14.03
## review_scores_rating 0.002472 0.000358 6.91
## room_typePrivate room -0.426002 0.009272 -45.94
## room_typeShared room -0.904553 0.019201 -47.11
## bedrooms 0.091438 0.006611 13.83
## beds -0.031420 0.003058 -10.27
## bathrooms 0.027585 0.003903 7.07
## accommodates 0.104229 0.002808 37.12
## host_is_superhostTRUE 0.025002 0.008554 2.92
## is_location_exactTRUE -0.056128 0.008001 -7.01
## neighbourhood_simplifiedRing 3 -0.201454 0.013509 -14.91
## neighbourhood_simplifiedRing 4 -0.196317 0.012120 -16.20
## neighbourhood_simplifiedRing 5 -0.222180 0.034552 -6.43
## neighbourhood_simplifiedRing 6 -0.309583 0.012135 -25.51
## cancellation_policymoderate 0.031338 0.009109 3.44
## cancellation_policystrict_14_with_grace_period 0.056111 0.010155 5.53
## log(security_deposit + 0.001) 0.006432 0.000665 9.67
## wifiTRUE 0.058051 0.025706 2.26
## breakfastTRUE 0.232856 0.013348 17.45
## services 0.007012 0.000419 16.74
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prop_type_simplifiedCondominium 0.00029 ***
## prop_type_simplifiedHouse 3.0e-14 ***
## prop_type_simplifiedLoft 0.14540
## prop_type_simplifiedOther < 2e-16 ***
## number_of_reviews < 2e-16 ***
## review_scores_rating 5.2e-12 ***
## room_typePrivate room < 2e-16 ***
## room_typeShared room < 2e-16 ***
## bedrooms < 2e-16 ***
## beds < 2e-16 ***
## bathrooms 1.6e-12 ***
## accommodates < 2e-16 ***
## host_is_superhostTRUE 0.00347 **
## is_location_exactTRUE 2.4e-12 ***
## neighbourhood_simplifiedRing 3 < 2e-16 ***
## neighbourhood_simplifiedRing 4 < 2e-16 ***
## neighbourhood_simplifiedRing 5 1.3e-10 ***
## neighbourhood_simplifiedRing 6 < 2e-16 ***
## cancellation_policymoderate 0.00058 ***
## cancellation_policystrict_14_with_grace_period 3.3e-08 ***
## log(security_deposit + 0.001) < 2e-16 ***
## wifiTRUE 0.02394 *
## breakfastTRUE < 2e-16 ***
## services < 2e-16 ***
##
## Residual standard error: 0.519 on 18546 degrees of freedom
## (14926 observations deleted due to missingness)
## Multiple R-squared: 0.544, Adjusted R-squared: 0.543
## F-statistic: 921 on 24 and 18546 DF, p-value: <2e-16
## # A tibble: 18,571 x 23
## .rownames price_4_nights prop_type_simpl… number_of_revie… review_scores_r…
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 12673 12.3 House 1 100
## 2 3135 12.4 Condominium 1 60
## 3 1360 12.0 Apartment 2 100
## 4 8288 12.5 Apartment 1 60
## 5 13739 11.6 Apartment 1 80
## 6 1355 12.5 Other 16 86
## 7 9017 12.5 Condominium 1 80
## 8 12477 11.7 Apartment 35 97
## 9 17536 12.1 Condominium 6 83
## 10 12907 11.3 Apartment 25 100
## # … with 18,561 more rows, and 18 more variables: room_type <chr>,
## # bedrooms <dbl>, beds <dbl>, bathrooms <dbl>, accommodates <dbl>,
## # host_is_superhost <lgl>, is_location_exact <lgl>,
## # neighbourhood_simplified <chr>, cancellation_policy <chr>,
## # `log(security_deposit + 0.001)` <dbl>, wifi <lgl>, breakfast <lgl>,
## # services <int>, .fitted <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>
Residuals v Fitted Residuals are random, do no follow any obvious pattern, and are centered around Y = 0. As a result, our linearity assumption hold TRUE.
Normal Q-Q There are substantial deviations from the straight line indicating that residuals may not follow a normal distribution. As a result, our normality assumption may not hold TRUE.
Scale-Location There are no apparent positive or negative trends across the fitted values, indicating that variability is constant. Therefore, our Equal Variance assumption holds TRUE.
Residuals v Leverage There seem to be various influential points with there being points with high leverage and points with high absolute residuals. As a result, this might have undue influences on estimates of model parameters.
price_4_nights For An Imaginary Airbnb# here is an imaginary Airbnb
imaginary_airbnb <- tibble(prop_type_simplified = "Apartment",
room_type = "Private room",
number_of_reviews = 10,
review_scores_rating = 90,
beds = 1,
bathrooms = 1,
bedrooms = 1,
accommodates = 2,
neighbourhood_simplified = "Ring 5",
cancellation_policy = "flexible",
host_is_superhost = FALSE,
is_location_exact = TRUE,
security_deposit = 0,
services = 15,
wifi = TRUE,
breakfast = FALSE
)
imaginary_airbnb## # A tibble: 1 x 16
## prop_type_simpl… room_type number_of_revie… review_scores_r… beds bathrooms
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apartment Private … 10 90 1 1
## # … with 10 more variables: bedrooms <dbl>, accommodates <dbl>,
## # neighbourhood_simplified <chr>, cancellation_policy <chr>,
## # host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## # services <dbl>, wifi <lgl>, breakfast <lgl>
# use broom::argument( ) to predict the price for this imaginary airbnb
lets_predict <- broom::augment(final_model,
newdata = imaginary_airbnb,
se_fit = TRUE)
# calculate 95% lower and upper confidence interval
lets_predict <- lets_predict %>%
mutate (
lower_ci = .fitted - 1.96* .se.fit,
upper_ci = .fitted + 1.96* .se.fit
)
lets_predict## # A tibble: 1 x 20
## prop_type_simpl… room_type number_of_revie… review_scores_r… beds bathrooms
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apartment Private … 10 90 1 1
## # … with 14 more variables: bedrooms <dbl>, accommodates <dbl>,
## # neighbourhood_simplified <chr>, cancellation_policy <chr>,
## # host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## # services <dbl>, wifi <lgl>, breakfast <lgl>, .fitted <dbl>, .se.fit <dbl>,
## # lower_ci <dbl>, upper_ci <dbl>
# viewing our results
view_final <- lets_predict %>%
select(c(lower_ci,
.fitted,
upper_ci,
.se.fit)
) %>%
mutate(
lower_ci = exp(lower_ci),
upper_ci = exp(upper_ci),
.fitted = exp(.fitted),
.se.fit = exp(.se.fit)
)
view_final## # A tibble: 1 x 4
## lower_ci .fitted upper_ci .se.fit
## <dbl> <dbl> <dbl> <dbl>
## 1 791. 846. 905. 1.03
## # A tibble: 5 x 16
## prop_type_simpl… room_type number_of_revie… review_scores_r… beds bathrooms
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apartment Private … 35 99 1 1
## 2 Apartment Private … 35 99 1 1
## 3 Apartment Private … 35 99 1 1
## 4 Apartment Private … 35 99 1 1
## 5 Apartment Private … 35 99 1 1
## # … with 10 more variables: bedrooms <dbl>, accommodates <dbl>,
## # neighbourhood_simplified <chr>, cancellation_policy <chr>,
## # host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## # services <dbl>, wifi <lgl>, breakfast <lgl>
## # A tibble: 5 x 20
## prop_type_simpl… room_type number_of_revie… review_scores_r… beds bathrooms
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apartment Private … 35 99 1 1
## 2 Apartment Private … 35 99 1 1
## 3 Apartment Private … 35 99 1 1
## 4 Apartment Private … 35 99 1 1
## 5 Apartment Private … 35 99 1 1
## # … with 14 more variables: bedrooms <dbl>, accommodates <dbl>,
## # neighbourhood_simplified <chr>, cancellation_policy <chr>,
## # host_is_superhost <lgl>, is_location_exact <lgl>, security_deposit <dbl>,
## # services <dbl>, wifi <lgl>, breakfast <lgl>, .fitted <dbl>, .se.fit <dbl>,
## # lower_ci <dbl>, upper_ci <dbl>
## # A tibble: 5 x 4
## lower_ci .fitted upper_ci .se.fit
## <dbl> <dbl> <dbl> <dbl>
## 1 1433. 1491. 1550. 1.02
## 2 1136. 1173. 1210. 1.02
## 3 877. 939. 1006. 1.04
## 4 823. 882. 945. 1.04
## 5 753. 807. 864. 1.04
As a result, we have predicted that the price for a 4 night stay at an Airbnb in Beijing with different characteristics is stated as above. One can notice the difference in prices when characteristics such as location and breakfast are changed.
In the following model we’ve used variables in the linear or log format. However, in reality variations in prices cannot be explained just by a linear regression model, therefore we believe we could further improve the explanatory power of our model but the methods required for this are outside the scope of the project.
This analytic project mainly exercises the use of:
Library: corrplot, dplyr, GGally, huxtable, patchwork, kableExtra, car, readr, rsample, ggridges, ggfortify, stringr.
Function: lm, autoplot, augment,kbl, tidy, ggpair.